Give the session info (reduced).
## [1] "R version 3.6.3 (2020-02-29)"
## [1] "x86_64-pc-linux-gnu"
Load packages. If they are not installed yet on your local machine, use install.packages() to install them.
library(readr)
library(MASS)
library(tidyr)
library(ggplot2)
library(ggridges)
library(scales)
library(plyr)
library(rcompanion)
library(rstatix)
Give the package versions.
## rstatix rcompanion plyr scales ggridges ggplot2 tidyr
## "0.6.0" "2.3.26" "1.8.6" "1.1.1" "0.5.3" "3.3.3" "1.1.2"
## MASS readr
## "7.3-53" "1.4.0"
Load results of all participants in the UD track directly from github repo by using the “readr” library. We here only use Track B as we have measures from different domains (morphology, syntax) in this track, while in Track A we have mainly morphological complexity measures.
#TRACK B (Universal Dependencies, UD)
brunato.results <- read_csv("https://raw.githubusercontent.com/IWMLC/language-complexity-metrics/master/UDtrack/Brunato_venturi/Brunato-Venturi.csv")
coltekin.results <- read_csv("https://raw.githubusercontent.com/IWMLC/language-complexity-metrics/master/UDtrack/Coltekin_rama/coltekin.csv")
semenuks.results <- read_csv("https://raw.githubusercontent.com/IWMLC/language-complexity-metrics/master/UDtrack/Semenuks/Semenuks.csv")
sinnemaki.results <- read_csv("https://raw.githubusercontent.com/IWMLC/language-complexity-metrics/master/UDtrack/Sinnemaki/Sinnemaki.csv")
sozinova.results <- read_csv("https://raw.githubusercontent.com/IWMLC/language-complexity-metrics/master/UDtrack/Sozinova_etal/sozinova_etal.csv")
Load file with meta-information on the measures. This is slightly adapted from a file prepared by Katharina Ehret and Alice Blumenthal-Dramé in the context of the Interactive Workshop on Measuring Language Complexity (IWMLC 2019).
meta.info <- read_csv("https://raw.githubusercontent.com/IWMLC/language-complexity-metrics/master/measures_metaInfo.csv")
Create a single data frame merging results together using the Reduce() and merge() functions of base R.
results <- Reduce(merge, list(brunato.results, coltekin.results, semenuks.results, sinnemaki.results, sozinova.results)) # by default the merge() function uses columns with the same names to merge on, i.e. "id" and "language" in our case.
Give some simple statistics for this data frame of results.
nrow(results) # i.e. number of different UD corpora
## [1] 63
length(unique(results$language)) # i.e. number of different languages
## [1] 44
unique(results$language) # output the language names
## [1] "Afrikaans" "Arabic" "Bulgarian"
## [4] "Catalan" "Czech" "Old Church Slavonic"
## [7] "Chinese" "Danish" "German"
## [10] "Greek" "Estonian" "English"
## [13] "Basque" "Persian" "Finnish"
## [16] "French" "Gothic" "Ancient Greek"
## [19] "Hebrew" "Hindi" "Croatian"
## [22] "Hungarian" "Indonesian" "Italian"
## [25] "Japanese" "Korean" "Latin"
## [28] "Latvian" "Dutch" "Norwegian"
## [31] "Polish" "Portuguese" "Romanian"
## [34] "Russian" "Slovak" "Slovenian"
## [37] "Spanish" "Serbian" "Swedish"
## [40] "Turkish" "Uyghur" "Ukrainian"
## [43] "Urdu" "Vietnamese"
ncol(results)-2 # i.e. number of complexity measures
## [1] 27
Invert the values (by substracting them from 1) for the measure “CR_inflection_accurracy”. Note that higher values in the original measure mean lower rather than higher complexity.
results$CR_inflection_accuracy <- 1-results$CR_inflection_accuracy
Center and scale all numerical columns to make them more comparable.
results.scaled <- cbind(results[1:2], scale(results[3:29])) # add meta-information columns again
Transform data frame from wide format to long format (this is necessary for later plotting and analyses by columns rather than rows).
results.long <- gather(results.scaled, key = measure, value = value, BV_n_tokens:SBS_DER)
Remove rows which contain NAs.
results.long <- results.long[complete.cases(results.long), ]
nrow(results.long)
## [1] 1544
Add meta-information on complexity measures.
results.long <- merge(results.long, meta.info[, 1:3], by = "measure")
Generate data frames for subdomains (morphology, syntax). This can be used for comparison with aggregated results.
results.morph <- results.long[results.long$domain == "morphology", ]
results.syntax <- results.long[results.long$domain == "syntax", ]
Plot density distributions of complexity measurements by language. Individual measurements for each complexity measure (and different corpora if available) are plotted as dots. The central value (0) is indicated by a vertical dotted line for visual reference.
Create data frame with both domains + “overall”, this is necessary for the plots below.
results.overall <- results.long
results.overall$domain <- rep("overall", times = nrow(results.long))
results.plot <- rbind(results.overall, results.long)
Plot density distributions with indication of median (mean) values.
density.plot <- ggplot(results.plot, aes(x = value, y = domain, fill = domain)) +
geom_vline(aes(xintercept = 0), color = "darkgrey") +
stat_density_ridges(quantile_lines = TRUE, quantiles = 0.5,
rel_min_height = 0.01, scale = 0.6, alpha = 0.5,
jittered_points = TRUE, point_shape = "o", point_size = 1,
size = 0.25, position = position_points_jitter(height = 0.1)) +
#geom_vline(data = med, aes(xintercept = grp.median), linetype = "dashed") +
facet_wrap(~ language) +
xlim(-5, 5) +
labs(x = "Complexity Value", y = "Density") +
theme_bw() +
theme(legend.position = "none")
print(density.plot)
## Warning: Removed 4 rows containing non-finite values (stat_density_ridges).
Safe figure to file.
ggsave("~/Github/ComplexityMetaAnalyses/Figures/EquiCompl/density_plot.pdf", density.plot,
dpi = 300, scale = 1, width = 12, height = 10, device = cairo_pdf)
Run statistical tests to assess whether the aggregate complexity distributions for certain pairs of languages differ signficantly from one another. Note that for this sample of 44 languages there are 44*(44-1)/2 = 946 possible pairwise comparisons.
Get mean, median, and standard deviation values.
# get median value for each language
med <- ddply(results.long, "language", summarise, grp.median = median(value))
# get mean value for each language
mu <- ddply(results.long, "language", summarise, grp.mean = mean(value))
# get standard deviation for each language
sdev <- ddply(results.long, "language", summarise, grp.sdev = sd(value))
Give an overview of languages with highest and lowest median complexity values.
stats.df <- cbind(mu, med[, 2], sdev[, 2])
colnames(stats.df) <- c("language", "mu", "med", "sdev")
stats.df.sorted <- stats.df[order(-stats.df$med),]
# round values to two decimal places, the "-1" excludes column 1
stats.df.sorted[, -1] <- round(stats.df.sorted[, -1], 2)
print(stats.df.sorted)
## language mu med sdev
## 21 Hungarian 0.39 0.46 1.36
## 18 Greek 0.28 0.43 0.60
## 33 Romanian 0.33 0.43 0.73
## 42 Urdu 0.44 0.43 1.56
## 2 Ancient Greek 0.35 0.43 0.83
## 4 Basque 0.19 0.42 0.70
## 8 Croatian 0.28 0.30 0.46
## 7 Chinese 0.13 0.20 1.68
## 16 German 0.11 0.20 0.97
## 35 Serbian 0.20 0.18 0.47
## 1 Afrikaans 0.13 0.13 1.08
## 30 Persian 0.06 0.10 1.54
## 37 Slovenian 0.16 0.07 0.49
## 29 Old Church Slavonic -0.03 0.03 1.15
## 22 Indonesian -0.08 0.02 1.44
## 17 Gothic -0.05 0.01 1.00
## 19 Hebrew 0.11 0.00 0.69
## 40 Turkish 0.08 -0.01 1.40
## 36 Slovak -0.30 -0.05 0.89
## 26 Latin 0.09 -0.06 0.92
## 31 Polish -0.32 -0.06 1.09
## 9 Czech 0.02 -0.07 0.66
## 34 Russian 0.09 -0.07 0.60
## 20 Hindi 0.22 -0.07 1.19
## 10 Danish -0.06 -0.09 0.34
## 27 Latvian 0.03 -0.10 0.66
## 5 Bulgarian -0.19 -0.10 0.57
## 6 Catalan 0.35 -0.12 0.93
## 3 Arabic 0.29 -0.12 1.68
## 41 Ukrainian 0.05 -0.12 0.54
## 43 Uyghur 0.51 -0.13 2.29
## 39 Swedish -0.14 -0.16 0.39
## 38 Spanish 0.18 -0.21 0.75
## 32 Portuguese 0.02 -0.22 0.61
## 28 Norwegian -0.22 -0.22 0.42
## 15 French -0.12 -0.25 0.69
## 12 English -0.22 -0.25 0.59
## 23 Italian -0.10 -0.27 0.70
## 24 Japanese -0.42 -0.31 1.27
## 13 Estonian -0.08 -0.36 0.82
## 14 Finnish -0.14 -0.46 1.06
## 11 Dutch -0.44 -0.53 0.90
## 25 Korean -0.29 -0.64 1.73
## 44 Vietnamese -0.48 -0.65 1.20
Output data frame as csv file.
write.csv(stats.df.sorted, file = "~/Github/ComplexityMetaAnalyses/Tables/descriptiveStats.csv", row.names = F)
The assumption that the tested data stems from a normally distributed population is often necessary for the mathematical proofs underlying standard statistical techniques. We might apply normality tests to check for this assumption (e.g. Baayen 2008, p. 73), but some statisticians advice against such pre-tests, since they are often too sensitive (MacDonald 2014, p. 133-136, Rasch et al. (2020), p. 67). In fact, Rasch et al. (2020, p. xi) argue based on earlier simulation studies that almost all standard statistical tests are fairly robust against deviations from normality. However, it is still advisable to check for gross deviations from normality in the data. One common way of doing this is quantile-quantile plots. The points should here roughly follow a straight line (Crawley 2007, p. 281).
ggplot(results.long, aes(sample = value)) +
stat_qq()
Select a statistical test: Standard t-tests can be used to assess significant differences in the means of the pseudo-complexity distributions, if we assume that the underlying population distributions are normal. Wilcoxon tests are a non-parametric alternative, i.e. they do not make assumptions about the underlying population distribution, e.g. normality (Crawley 2007, p. 283; Baayen 2008, p. 77). Since there are some deviations from normality visible in the QQ-Plot above, we here choose a Wilcoxon test. If we supply two data vectors, then by default the command wilcox.test() runs a Wilcoxon rank sum test (for unpaired samples), and with “paired = T” a Wilcoxon signed rank test. Note that while the same measurement procedures were used here across different languages (and corresponding text samples) to assess different types of complexity, there are sometimes NAs in the original data. We thus have to consider the resulting vectors “unpaired”. Note that pairwise.wilcox.test() is a function of the R-core stats package which runs multiple tests, i.e. for all groups in the “language” column in our case.
P-value adjustment for multiple comparisons: In case of multiple testing, we should account for the fact that the likelihood of finding a significant result by chance increases with the number of statistical tests. One of the most conservative methods to account for this is the so-called Bonferroni correction, i.e. multiplying the p-values with the number of tests. This method is appropriate when tests are independent of one another (MacDonald 2014, p. 254-260). Since we here run pairwise tests by languages, our tests are not independent (the same language is tested against others multiple times). We therefore apply the so-called Holm-Bonferroni method, which is less conservative. It does not assume independence between tests (see the descriptions in the vignette invoked by the command “?p.adjust()”).
# we add some random noise here to the value vector with
# the function jitter(), since we otherwise get warnings due to ties in the data
p.values <- pairwise.wilcox.test(jitter(results.long$value), results.long$language,
paired = F, p.adjust.method = "holm")
# get dataframe with pairs of languages where p < 0.05
m <- p.values$p.value
indices <- which(m < 0.05, arr.ind = T)
v1 <- rownames(m)[indices[, "row"]]
v2 <- colnames(m)[indices[, "col"]]
df <- cbind(v1, v2)
print(df)
## v1 v2
## [1,] "Romanian" "Dutch"
## [2,] "Romanian" "English"
## [3,] "Romanian" "Norwegian"
p.values.morph <- pairwise.wilcox.test(jitter(results.morph$value), results.morph$language,
paired = F, p.adjust.method = "holm") # we add some random noise here to the value vector with
# the function jitter(), since we otherwise get warnings due to ties in the data
# get dataframe with pairs of languages where p < 0.05
m <- p.values.morph$p.value
indices <- which(m < 0.05, arr.ind = T)
v1 <- rownames(m)[indices[, "row"]]
v2 <- colnames(m)[indices[, "col"]]
df <- cbind(v1, v2)
print(df)
## v1 v2
## [1,] "English" "Ancient Greek"
## [2,] "Italian" "Ancient Greek"
## [3,] "Spanish" "Ancient Greek"
## [4,] "English" "Basque"
## [5,] "Italian" "Basque"
## [6,] "Polish" "Catalan"
## [7,] "English" "Czech"
## [8,] "French" "Czech"
## [9,] "Italian" "Czech"
## [10,] "Japanese" "Czech"
## [11,] "Spanish" "Czech"
## [12,] "Polish" "Dutch"
## [13,] "Finnish" "English"
## [14,] "Polish" "English"
## [15,] "Russian" "English"
## [16,] "Slovak" "English"
## [17,] "Polish" "French"
## [18,] "Polish" "Italian"
## [19,] "Polish" "Japanese"
## [20,] "Spanish" "Polish"
p.values.syntax <- pairwise.wilcox.test(jitter(results.syntax$value), results.syntax$language,
paired = F, p.adjust.method = "holm") # we add some random noise here to the value vector with
# the function jitter(), since we otherwise get warnings due to ties in the data
m <- p.values.syntax$p.value
indices <- which(m < 0.05, arr.ind = T)
v1 <- rownames(m)[indices[, "row"]]
v2 <- colnames(m)[indices[, "col"]]
df <- cbind(v1, v2)
print(df)
## v1 v2
## [1,] "Finnish" "Catalan"
## [2,] "Polish" "Catalan"
## [3,] "Slovak" "Catalan"
## [4,] "Finnish" "Chinese"
## [5,] "Polish" "Chinese"
## [6,] "Spanish" "Czech"
## [7,] "Finnish" "English"
## [8,] "Polish" "English"
## [9,] "Hebrew" "Finnish"
## [10,] "Hungarian" "Finnish"
## [11,] "Indonesian" "Finnish"
## [12,] "Romanian" "Finnish"
## [13,] "Spanish" "Finnish"
## [14,] "Spanish" "Norwegian"
## [15,] "Romanian" "Polish"
## [16,] "Spanish" "Polish"
## [17,] "Spanish" "Slovak"
Statistical significance is only one part of the story. For instance, a difference in complexity values might be statistically significant, but so small that it is negligible for any theorizing. In fact, it is sometimes argued that effect sizes - rather than p-values - should be the aim of statistical inquiry (Cahusac 2020, p. 12-15). An overview of effect size measures per statistical test is given in Patil (2020). In conjunction with the Wilcoxon signed rank test we here use the statistic r (i.e. function wilcox_effsize() of the “rstatix” package).
effect.sizes <- wilcox_effsize(results.long, value ~ language, paired = F)
#print(effect.sizes)
effect.sizes.morph <- wilcox_effsize(results.morph, value ~ language, paired = F)
#print(effect.sizes.morph)
effect.sizes.syntax <- wilcox_effsize(results.syntax, value ~ language, paired = F)
#print(effect.sizes.syntax)
Plot a heatmap with effect sizes to get a better overview.
effect.sizes.plot <- ggplot(as.data.frame(effect.sizes), aes(group1, group2)) +
geom_tile(aes(fill = effsize), color = "white") +
scale_fill_gradient2(low = "light blue", mid = "light grey", high = "red",
midpoint = 0.5, limit = c(0,1)) +
geom_text(aes(label = round(effsize, 2))) +
labs(x = "", y = "") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
effect.sizes.plot
Safe figure to file.
#ggsave("~/Github/ComplexityMetaAnalyses/Figures/EquiCompl/effectSize_plot.pdf", effect.sizes.plot,
# dpi = 300, scale = 1, width = 20, height = 20, device = cairo_pdf)
effect.sizes.plot <- ggplot(as.data.frame(effect.sizes.morph), aes(group1, group2)) +
geom_tile(aes(fill = effsize), color = "white") +
scale_fill_gradient2(low = "light blue", mid = "light grey", high = "red",
midpoint = 0.5, limit = c(0,1)) +
geom_text(aes(label = round(effsize, 2))) +
labs(x = "", y = "") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
effect.sizes.plot
effect.sizes.plot <- ggplot(as.data.frame(effect.sizes.syntax), aes(group1, group2)) +
geom_tile(aes(fill = effsize), color = "white") +
scale_fill_gradient2(low = "light blue", mid = "light grey", high = "red",
midpoint = 0.5, limit = c(0,1)) +
geom_text(aes(label = round(effsize, 2))) +
labs(x = "", y = "") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
effect.sizes.plot
We find very little evidence for systematic differences (i.e. statistically significant differences with moderate to high effect sizes) in the aggregate complexities of the 44 languages investigated here. In fact, there are only three pairs of languages out of 946 for which we find statistically significant location shifts in the overall complexity distributions, and moderate effect sizes: Dutch and Romanian, English and Romanian, Romanian and Norwegian. For the rest of the pairs there is no evidence of a significant and considerable location shift.
So, given the data, methods, and languages in this meta study, there is overboarding evidence that languages are equi-complex, if different dimensions of complexity are aggregated.
Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction using statistics in R. Cambridge University Press.
Cahusac, P. M. B. (2021). Evidence-based statistics. John Wiley & Sons.
Crawley, M. J. (2007). The R book. John Wiley & Sons Ltd.
Ehret, K. and B. Szmrecsanyi (2016). An information-theoretic approach to assess linguistic complexity. In: R. Baechler & G. Seiler (eds.), Complexity, Isolation, and Variation, 71-94. Berlin: de Gruyter.
Kruschke, J. K. (2012). Bayesian estimation supersedes the t test. Journal of Experimental Psychology.
McDonald, J.H. (2014). Handbook of Biological Statistics (3rd ed.). Sparky House Publishing, Baltimore, Maryland. online at http://www.biostathandbook.com
Patil, I. (2020). Test and effect size details. online at https://cran.r-project.org/web/packages/statsExpressions/vignettes/stats_details.html.
Rasch, D., Verdooren, R., and J. Pilz (2020). Applied statistics. Theory and problem solutions with R. John Wiley & Sons Ltd.